library(tidyverse)
library(modelr)
library(GGally)
library(ggfortify)
Q1 * You might like to think about removing some or all of
date, id, sqft_living15,
sqft_lot15 and zipcode (lat and
long provide a better measure of location in any
event).
house <- read_csv("data/kc_house_data.csv") %>% janitor::clean_names()
Rows: 21613 Columns: 21── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): id
dbl (19): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_ab...
dttm (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(house)
[1] "id" "date" "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot"
[8] "floors" "waterfront" "view" "condition" "grade" "sqft_above" "sqft_basement"
[15] "yr_built" "yr_renovated" "zipcode" "lat" "long" "sqft_living15" "sqft_lot15"
house <- house %>% select(-c( "date","id", "sqft_lot15", "sqft_living15", "zipcode"))
names(house)
[1] "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot" "floors" "waterfront"
[8] "view" "condition" "grade" "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
[15] "lat" "long"
waterfront. Should we
convert its type?house <- house %>% mutate(waterfront = as.logical(waterfront))
yr_renovated into a renovated
logical variable, indicating whether the property had ever been
renovated. You may wish to do the same.house <- house %>% mutate(renovated = if_else(yr_renovated == 0, FALSE, TRUE)) %>%
select(-yr_renovated)
view,
condition and grade? Are they interval or
categorical ordinal data types?view - An index from 0 to 4 of how good the view of the property was condition - An index from 1 to 5 on the condition of the apartment grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design
They are all categorical but grades has too many levels
I could do the same to the view and condition (use words rather than number but it's
works just as it is)
#house <- house %>% mutate(grade = case_when(grade > 10 ~ "high",
# grade > 7 ~"above_avg",
# grade > 3 ~"below_avg",
# TRUE ~ "low")) %>%
# fastDummies::dummy_cols(select_columns = c("view","grade","condition"), remove_first_dummy = TRUE, remove_selected_columns = TRUE) %>%
# mutate(across(view_1:condition_5, as.logical))
#I went back here from ggpairs,, well. I will keep them as on col.
house <- house %>% mutate(grade = case_when(grade > 10 ~ "high",
grade > 7 ~"above_avg",
grade > 3 ~"below_avg",
TRUE ~ "low"),
view = case_when(view == 0 ~ "very_bad",
view == 1 ~ "bad",
view == 2 ~ "okay",
view == 3 ~ "good",
TRUE ~ "very_good"),
condition = case_when(condition == 1 ~ "very bad",
condition == 2 ~ "bad",
condition == 3 ~ "okay",
condition == 4 ~ "good",
TRUE ~ "very_good"))
Check for aliased variables using the alias() function (this takes in a formula object and a data set). [Hint - formula price ~ . says ‘price varying with all predictors’, this is a suitable input to alias()]. Remove variables that lead to an alias. Check the ‘Elements of multiple regression’ lesson for a dropdown containing further information on finding aliased variables in a dataset.
alias(lm(price ~ ., data = house))
Model :
price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_built + lat + long + renovated
Complete :
(Intercept) bedrooms bathrooms sqft_living sqft_lot floors waterfrontTRUE viewgood viewokay
sqft_basement 0 0 0 1 0 0 0 0 0
viewvery_bad viewvery_good conditiongood conditionokay conditionvery bad conditionvery_good
sqft_basement 0 0 0 0 0 0
gradebelow_avg gradehigh gradelow sqft_above yr_built lat long renovatedTRUE
sqft_basement 0 0 0 -1 0 0 0 0
house <- house %>% select(-c("sqft_living", "sqft_above"))
Systematically build a regression model containing up to four main effects (remember, a main effect is just a single predictor with coefficient), testing the regression diagnostics as you go * splitting datasets into numeric and non-numeric columns might help ggpairs() run in manageable time, although you will need to add either a price or resid column to the non-numeric dataframe in order to see its correlations with the non-numeric predictors.
and the same in subsequent rounds of predictor selection with the resid column.
Remember, if you are not sure whether including a categorical predictor is statistically justified, run an anova() test passing in the models with- and without the categorical predictor and check the p-value of the test.
houses_tidy_numeric <- house %>%
select_if(is.numeric)
houses_tidy_nonnumeric <- house %>%
select_if(function(x) !is.numeric(x))
houses_tidy_nonnumeric$price <- house$price
ggpairs(houses_tidy_numeric, progress = FALSE)
ggpairs(houses_tidy_nonnumeric, progress = FALSE)
very bad one…. r2 0.2758, diagnositic plot doesn’t look too bad, under estimating everthing bit but not too crazy ?
mod1b <- lm(price ~ waterfront, data = house)
summary(mod1b)
Call:
lm(formula = price ~ waterfront, data = house)
Residuals:
Min 1Q Median 3Q Max
-1376876 -211564 -81564 108436 7168436
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 531564 2416 220.00 <2e-16 ***
waterfrontTRUE 1130312 27822 40.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 353900 on 21611 degrees of freedom
Multiple R-squared: 0.07095, Adjusted R-squared: 0.07091
F-statistic: 1650 on 1 and 21611 DF, p-value: < 2.2e-16
autoplot(mod1b)
very bad very bad….. r2 0.0709, diagnositic plot are bad too
mod1c <- lm(price ~ grade, data = house)
summary(mod1c)
Call:
lm(formula = price ~ grade, data = house)
Residuals:
Min 1Q Median 3Q Max
-1258635 -150632 -45632 96365 6021365
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 665392 2930 227.138 < 2e-16 ***
gradebelow_avg -284760 4006 -71.093 < 2e-16 ***
gradehigh 1013243 13282 76.289 < 2e-16 ***
gradelow -475642 145156 -3.277 0.00105 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 290300 on 21609 degrees of freedom
Multiple R-squared: 0.375, Adjusted R-squared: 0.3749
F-statistic: 4322 on 3 and 21609 DF, p-value: < 2.2e-16
autoplot(mod1c)
grade is slightly better than waterfront
add them together
numeric_resid <- houses_tidy_numeric %>%
add_residuals(mod1a) %>%
select(-c(price,bathrooms))
numeric_resid %>%
select(resid, everything()) %>%
ggpairs(aes(alpha = 0.5), progress = FALSE)
lat highest
nonnumeric_resid <- houses_tidy_nonnumeric %>%
add_residuals(mod1b) %>%
select(-c(price,grade))
nonnumeric_resid %>%
select(resid, everything()) %>%
ggpairs(aes(alpha = 0.5), progress = FALSE)
mod2a <- lm(price ~ bathrooms + lat,
data = house)
summary(mod2a)
Call:
lm(formula = price ~ bathrooms + lat, data = house)
Residuals:
Min 1Q Median 3Q Max
-1444737 -156172 -40891 89682 5863413
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37064219 684615 -54.14 <2e-16 ***
bathrooms 246880 2590 95.31 <2e-16 ***
lat 779692 14397 54.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 293200 on 21610 degrees of freedom
Multiple R-squared: 0.3623, Adjusted R-squared: 0.3623
F-statistic: 6139 on 2 and 21610 DF, p-value: < 2.2e-16
mod2b <- lm(price ~ bathrooms + grade,
data = house)
summary(mod2b)
Call:
lm(formula = price ~ bathrooms + grade, data = house)
Residuals:
Min 1Q Median 3Q Max
-1302789 -154247 -37212 100836 5427333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 324491 8173 39.704 <2e-16 ***
bathrooms 135902 3060 44.409 <2e-16 ***
gradebelow_avg -175681 4554 -38.580 <2e-16 ***
gradehigh 860957 13169 65.379 <2e-16 ***
gradelow -160222 139138 -1.152 0.25
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 277900 on 21608 degrees of freedom
Multiple R-squared: 0.4273, Adjusted R-squared: 0.4272
F-statistic: 4030 on 4 and 21608 DF, p-value: < 2.2e-16
it’s better but still alot of postivie errors
anova(mod1a, mod2b)
Analysis of Variance Table
Model 1: price ~ bathrooms
Model 2: price ~ bathrooms + grade
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21611 2.1096e+15
2 21608 1.6682e+15 3 4.4139e+14 1905.7 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
significant, will keep grade because I had grade in my mod1, so I have run the ggplair for non_numeric without grade, I will try add third one, Lat
mod3a <- lm(price ~ bathrooms + grade + lat,
data = house)
summary(mod3a)
Call:
lm(formula = price ~ bathrooms + grade + lat, data = house)
Residuals:
Min 1Q Median 3Q Max
-1226396 -132350 -31376 78121 5379338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -33208995 611532 -54.305 <2e-16 ***
bathrooms 142127 2870 49.529 <2e-16 ***
gradebelow_avg -154494 4284 -36.063 <2e-16 ***
gradehigh 833638 12348 67.509 <2e-16 ***
gradelow -13441 130392 -0.103 0.918
lat 704580 12848 54.839 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 260300 on 21607 degrees of freedom
Multiple R-squared: 0.4973, Adjusted R-squared: 0.4972
F-statistic: 4274 on 5 and 21607 DF, p-value: < 2.2e-16
autoplot(mod3a)
the r2 is higher again.
mod3b <- lm(price ~ bathrooms + grade + view,
data = house)
summary(mod3b)
Call:
lm(formula = price ~ bathrooms + grade + view, data = house)
Residuals:
Min 1Q Median 3Q Max
-1613984 -145203 -31951 102727 5345629
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 522214 16176 32.284 < 2e-16 ***
bathrooms 126336 2881 43.846 < 2e-16 ***
gradebelow_avg -155486 4301 -36.147 < 2e-16 ***
gradehigh 766482 12499 61.326 < 2e-16 ***
gradelow -143538 130706 -1.098 0.27214
viewgood 54984 18428 2.984 0.00285 **
viewokay -52213 16614 -3.143 0.00168 **
viewvery_bad -212614 14475 -14.688 < 2e-16 ***
viewvery_good 467270 20532 22.758 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 261000 on 21604 degrees of freedom
Multiple R-squared: 0.4947, Adjusted R-squared: 0.4945
F-statistic: 2644 on 8 and 21604 DF, p-value: < 2.2e-16
autoplot(mod3b)
numeric_resid <- houses_tidy_numeric%>%
add_residuals(mod2a) %>%
select(-c(price,bathrooms,lat))
numeric_resid %>%
select(resid, everything()) %>%
ggpairs(aes(alpha = 0.5), progress = FALSE)
mod4a <- lm(price ~ bathrooms + grade + lat + bathrooms:grade,
data = house)
summary(mod4a)
Call:
lm(formula = price ~ bathrooms + grade + lat + bathrooms:grade,
data = house)
Residuals:
Min 1Q Median 3Q Max
-2187131 -125569 -28053 70933 4958061
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32747605 594442 -55.090 < 2e-16 ***
bathrooms 203804 4398 46.335 < 2e-16 ***
gradebelow_avg 130998 13247 9.889 < 2e-16 ***
gradehigh -126328 45985 -2.747 0.00602 **
gradelow 100968 146504 0.689 0.49071
lat 691629 12494 55.358 < 2e-16 ***
bathrooms:gradebelow_avg -138519 5802 -23.872 < 2e-16 ***
bathrooms:gradehigh 245586 12668 19.386 < 2e-16 ***
bathrooms:gradelow 140306 389528 0.360 0.71871
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 253000 on 21604 degrees of freedom
Multiple R-squared: 0.5254, Adjusted R-squared: 0.5252
F-statistic: 2989 on 8 and 21604 DF, p-value: < 2.2e-16
autoplot(mod4a)
mod4a:0.5254, only normal qq looks bad
mod4b <- lm(price ~ bathrooms + grade + lat + bathrooms:lat,
data = house)
summary(mod4b)
Call:
lm(formula = price ~ bathrooms + grade + lat + bathrooms:lat,
data = house)
Residuals:
Min 1Q Median 3Q Max
-1063745 -124483 -30942 72562 5289742
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4001918 1922240 -2.082 0.0374 *
bathrooms -14009902 883599 -15.855 <2e-16 ***
gradebelow_avg -155110 4259 -36.419 <2e-16 ***
gradehigh 820976 12302 66.738 <2e-16 ***
gradelow -108803 129764 -0.838 0.4018
lat 90624 40405 2.243 0.0249 *
bathrooms:lat 297487 18574 16.016 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 258800 on 21606 degrees of freedom
Multiple R-squared: 0.5032, Adjusted R-squared: 0.503
F-statistic: 3647 on 6 and 21606 DF, p-value: < 2.2e-16
autoplot(mod4b)
r2:0.5032, top two looks bad
mod4c <- lm(price ~ bathrooms + grade + lat + grade:lat,
data = house)
summary(mod4c)
Call:
lm(formula = price ~ bathrooms + grade + lat + grade:lat, data = house)
Residuals:
Min 1Q Median 3Q Max
-1097205 -125650 -34043 75353 5356006
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39936914 947318 -42.158 < 2e-16 ***
bathrooms 141196 2862 49.335 < 2e-16 ***
gradebelow_avg 12054659 1240728 9.716 < 2e-16 ***
gradehigh -36637240 6582036 -5.566 2.63e-08 ***
gradelow 26221560 41299081 0.635 0.525
lat 846057 19913 42.487 < 2e-16 ***
gradebelow_avg:lat -256720 26088 -9.841 < 2e-16 ***
gradehigh:lat 787134 138278 5.692 1.27e-08 ***
gradelow:lat -553157 871585 -0.635 0.526
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 259500 on 21604 degrees of freedom
Multiple R-squared: 0.5006, Adjusted R-squared: 0.5004
F-statistic: 2707 on 8 and 21604 DF, p-value: < 2.2e-16
autoplot(mod4c)
r2 0.5006, top two graphs looks bad
mod4a is the best one